Skip to content

Parallelize Import Process#43

Open
vjekoslav wants to merge 21 commits intosenko:mainfrom
vjekoslav:import-parallel
Open

Parallelize Import Process#43
vjekoslav wants to merge 21 commits intosenko:mainfrom
vjekoslav:import-parallel

Conversation

@vjekoslav
Copy link
Copy Markdown
Contributor

This pull request aims to significantly improve performance and reduce the time required to import large datasets. The changes primarily affect the import logic and related modules.

The previous sequential import process was a bottleneck for large data files. By parallelizing the workload, we can leverage multi-core CPUs and achieve much faster import times, making the system more scalable and responsive. There are other approaches as well, CSV imports, temporary table copy, code optimizations, race conditions evasion, and probably more.

Key Changes

  • Refactored import logic to support parallel execution (using threads/processes as appropriate).
  • Refactored and updated relevant modules in service/db/import.py and service/db/importer/ to enable concurrent processing of data chunks.
  • Database PK changes

Daily import speed comparison

On MacBook Pro M2:
Before optimization: ~350 seconds for 20 stores
After optimization: ~100 seconds for 20 stores

Screenshot 2025-07-22 at 11 06 58 Screenshot 2025-07-22 at 11 06 04

I've tested the data for consistency with previous import. Please, review that part as well.

I've tried extracting anchor_price into a new table because that value never changes and is always the same in every import. Currently, it is duplicated per row per day. However, whatever I tried, it would slow down the import process, so I've postponed that optimization for later. It would shrink the prices table, and anchor_price insert should be 99% skippable.

Vjeko Nikolic and others added 21 commits July 14, 2025 14:57
…SV price processing

The db object is properly initialized from settings.get_db()
@senko
Copy link
Copy Markdown
Owner

senko commented Jul 22, 2025

Hvala na PR-u, ali molim te razbi ovo u nekoliko malih nezavisnih, jer reviewanje 1300 linija promjena sa masu refactora nije baš ugodno.

Također bih cijenio da refactore i/ili promjenu baze prvo dogovorimo u issueu prije, jer je ovako rizik da dobijem PR u koji uložiš puno truda jer nam se vizije ne poklapaju.

Konkretno za samu paralelizaciju mislim da je dovoljno ne awaitati nego spremiti future u array i onda await all, tj 2-3 linije promjene (+ još desetak ako to želimo kao nondefault opciju).

@vjekoslav
Copy link
Copy Markdown
Contributor Author

Imas pravo, ovo je bilo djelomicno eksplorativno, da vidim sta se moze napraviti.

Ono sto sam pronasao je race condition kod EAN procesiranja u paralelnom radu. Tocnije deadlockove na nivou baze. Mislim da je problem bio da EAN kod ne bi postojao u bazi a da bi dodavali proizvod ili tako nekako. Tako da su ovdje dvije faze, prva je sekvencijalno procesiranje svih EAN-a i dictionary koji dijele svi paralelni proces, a druga faza je paralelno procesiranje cijena. EANi mutiraju samo u prvoj fazi.

Pokusao sam i sa DB lockovima i sa semaforima, ali su se stvari previse zakomplicirale s vremenom, kod je postao prekompleksan.

Malo sam zaboravio tocne detalje problema.

BTW. pokusao sam se maknuti od import.py jer je import rezervirana rijec u Pythonu, pa radi probleme kod importa. Importer to rijesava vecinom jer se sada fileovi mogu importirati.

Probati cu razbiti u nezavisne PR-ove.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants